Reshaping and Joining Data Frames

In this lesson, you will learn:


▶️ Watch Videos: 25 minutes

📖 Readings: 30-60 minutes

💻 Activities: 45-60 minutes

✅ Check-ins: 3


1 Code Style

Check-in: Code Style

You’ve been spending time and getting feedback on the formatting of your code. Based on what you’ve read, address the following questions.

Question 1: Based on the tidyverse style guide, where should whitespace (blank spaces) go?

  • Before and after every + sign

  • Before and after every %>%

  • Before and after every athematic operator (+, -, *, /)

  • After every comma

  • Before every comma

Question 2: How many characters per line should you strive to have in your R code (and RMarkdown files)?

Question 3: Based on the styleguide, which of the following are bad object names?

  • data

  • cereals_data

  • cereals.data

  • cerealsData

  • cerealsdata

  • cereals_data1

  • cereals_data_clean

2 Reshaping Data

2.1 Tidy Data and Reshaping

📖 Required Reading: R4DS Chapter 6 (Data tidying)

▶️ Required Video: Data Layouts – 7 minutes

▶️ Required Video: Pivoting Longer – 9 minutes

▶️ Required Video: Pivoting Wider – 9 minutes

Check-in 4.1: Pivoting

Download the cereal.csv data here.

Question 1: Create a new dataset called cereals_long, that has three columns:

  • The name of the cereal

  • A column called Nutrient with values "protein", "fat", or "fiber".

  • A column called Amount with the corresponding amount of the nutrient.

3 Joining data

📖 Required Reading: R4DS Chapter 21 (Joins)

Note

Because these chapters do a much better job visualizing the concepts, I’ve chosen not to record additional videos.

Check-in: Practice with Joins

The following code creates three datasets / tibbles:

prof_info <- tibble(
  professor = c("Bodwin", 
                "Glanz", 
                "Robinson",
                "Theobold"),
  undergrad_school = c("Harvard", 
                       "Cal Poly",
                       "Winona State University",
                       "Colorado Mesa University"),
  grad_school = c("UNC", 
                  "Boston University", 
                  "University of Nebraska-Lincoln",
                  "Montana State University")
)

prof_course <- tibble(
  professor = c("Bodwin", "Robinson", "Theobold", "Carlton"),
  Stat_331 = c(TRUE, TRUE, TRUE, TRUE),
  Stat_330 = c(FALSE, TRUE, FALSE, FALSE),
  Stat_431 = c(TRUE, TRUE, TRUE, FALSE)
)

course_info <- data.frame(
  course = c("Stat_331", "Stat_330", "Stat_431"),
  num_sections = c(8, 3, 1)
)

Here is what they look like once created:

prof_info

prof_info
# A tibble: 4 × 3
  professor undergrad_school         grad_school                   
  <chr>     <chr>                    <chr>                         
1 Bodwin    Harvard                  UNC                           
2 Glanz     Cal Poly                 Boston University             
3 Robinson  Winona State University  University of Nebraska-Lincoln
4 Theobold  Colorado Mesa University Montana State University      
prof_course 
# A tibble: 4 × 4
  professor Stat_331 Stat_330 Stat_431
  <chr>     <lgl>    <lgl>    <lgl>   
1 Bodwin    TRUE     FALSE    TRUE    
2 Robinson  TRUE     TRUE     TRUE    
3 Theobold  TRUE     FALSE    TRUE    
4 Carlton   TRUE     FALSE    FALSE   
course_info
    course num_sections
1 Stat_331            8
2 Stat_330            3
3 Stat_431            1

These datasets contain information about five Cal Poly professors, their educational history, the classes they are able to teach, and the number of sections of each class that need to be assigned.

a) Combine datasets 1 and 2 to make this dataset:

  <chr>     <chr>                    <chr>                          <lgl>    <lgl>    <lgl>   
1 Bodwin    Harvard                  UNC                            TRUE     FALSE    TRUE    
2 Glanz     Cal Poly                 Boston University              TRUE     TRUE     TRUE    
3 Robinson  Winona State University  University of Nebraska-Lincoln TRUE     TRUE     TRUE    
4 Theobold  Colorado Mesa University Montana State University       TRUE     FALSE    TRUE 

b) Combine datasets 1 and 2 to make this dataset:

  professor Stat_331 Stat_330 Stat_431 undergrad_school         grad_school                   
  <chr>     <lgl>    <lgl>    <lgl>    <chr>                    <chr>                         
1 Bodwin    TRUE     FALSE    TRUE     Harvard                  UNC                           
2 Robinson  TRUE     TRUE     TRUE     Winona State University  University of Nebraska-Lincoln
3 Theobold  TRUE     FALSE    TRUE     Colorado Mesa University Montana State University      
4 Carlton   TRUE     FALSE    FALSE    NA                       NA      

c) Combine datasets 2 and 3 to make this dataset:

   professor course   can_teach num_sections
   <chr>     <chr>    <lgl>            <dbl>
 1 Bodwin    Stat_331 TRUE                 8
 2 Bodwin    Stat_330 FALSE                3
 3 Bodwin    Stat_431 TRUE                 1
 4 Robinson  Stat_331 TRUE                 8
 5 Robinson  Stat_330 TRUE                 3
 6 Robinson  Stat_431 TRUE                 1
 7 Theobold  Stat_331 TRUE                 8
 8 Theobold  Stat_330 FALSE                3
 9 Theobold  Stat_431 TRUE                 1
10 Carlton   Stat_331 TRUE                 8
11 Carlton   Stat_330 FALSE                3
12 Carlton   Stat_431 FALSE                1